Ground-truthing and benchmarking document page segmentation
نویسندگان
چکیده
We describe a new approach for evaluating page segmentation algorithms. Unlike techniques that rely on OCR output, our method is region-based: the segmentation output, described as a set of regions together with their types, output order etc., is matched against the pre-stored set of ground-truth regions. Misclassifications, splitting, and merging of regions are among the errors that are detected by the system. Each error is weighted individually for a particular application and a global estimate of segmentation quality is derived. The system can be customized to benchmark specific aspects of segmentation (e.g., headline detection) and according to the type of error correction that might follow (e.g., re-typing). Segmentation ground-truth files are quickly and easily generated and edited using GroundsKeeper, an X-Window based tool that allows one to view a document, manually draw regions (arbitrary polygons) on it, and specify information about each region (e.g., type, parent).
منابع مشابه
Benchmarking page segmentation algorithms
A method for automatically evaluating the quality of document page segmentation algorithms is introduced. Many different zoning techniques are now available, but there exists no robust method to benchmark and evaluate them reliably. Our proposed strategy is a region-based approach, in which segmentation results are compared with manually generated "ground truth files", describing all possible c...
متن کاملWhy Table Ground-Truthing is Hard
The principle that for every document analysis task there exists a mechanism for creating well-defined ground-truth is a widely held tenet. Past experience with standard datasets providing ground-truth for character recognition and page segmentation tasks supports this belief. In the process of attempting to evaluate several table recognition algorithms we have been developing, however, we have...
متن کاملPersian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کاملOn Benchmarking of Invoice Analysis Systems
An approach is presented to guide the benchmarking of invoice analysis systems, a specific, applied subclass of document analysis systems. The state of the art of benchmarking of document analysis systems is presented, based on the processing levels: Document Page Segmentation, Text Recognition, Document Classification, and Information Extraction. The restriction to invoices enables and require...
متن کاملA Region-based System for the Automatic Evaluation of Page Segmentation Algorithms
A method for automatically evaluating the quality of document page segmentation algorithms is described. Page segmentation involves decomposing a page into its structural and logical units such as paragraphs, halftones, captions and tables. These units are then ordered and logically associated. These two steps are very important in a document recognition strategy. Many di erent techniques have ...
متن کامل